We will cover some functions from YouTube Data API v3 from Google Developer Console.
Important Links:
We will use the following function:
Video Tutorial:
There is a Python Google Library. But we will be using HTTP requests to access the API.
In [73]:
api_key = ""
In [57]:
from __future__ import division
from datetime import datetime
import requests
from lxml import html, etree
import json
from textblob import TextBlob
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
pd.options.display.max_columns = 100
pd.options.display.max_rows = 35
pd.options.display.width = 120
https://developers.google.com/youtube/v3/docs/search
GET https://www.googleapis.com/youtube/v3/search
Parameter name | Value | Description |
---|---|---|
Required parameters | ||
part |
string |
The part parameter specifies a comma-separated list of one or more search resource properties that the API response will include. Set the parameter value to snippet . The snippet part has a quota cost of 1 unit.
|
Filters (specify 0 or 1 of the following parameters) | ||
forContentOwner |
boolean |
This parameter can only be used in a properly authorized request. Note: This parameter is intended exclusively for YouTube content partners. The forContentOwner parameter restricts the search to only retrieve resources owned by the content owner specified by the onBehalfOfContentOwner parameter. The user must be authenticated using a CMS account linked to the specified content owner and onBehalfOfContentOwner must be provided. |
forMine |
boolean |
This parameter can only be used in a properly authorized request. The forMine parameter restricts the search to only retrieve videos owned by the authenticated user. If you set this parameter to true , then the type parameter's value must also be set to video . |
relatedToVideoId |
string |
The relatedToVideoId parameter retrieves a list of videos that are related to the video that the parameter value identifies. The parameter value must be set to a YouTube video ID and, if you are using this parameter, the type parameter must be set to video . |
Optional parameters | ||
channelId |
string |
The channelId parameter indicates that the API response should only contain resources created by the channel |
channelType |
string |
The channelType parameter lets you restrict a search to a particular type of channel.Acceptable values are:
|
eventType |
string |
The eventType parameter restricts a search to broadcast events. If you specify a value for this parameter, you must also set the type parameter's value to video .Acceptable values are:
|
location |
string |
The location parameter, in conjunction with the locationRadius parameter, defines a circular geographic area and also restricts a search to videos that specify, in their metadata, a geographic location that falls within that area. The parameter value is a string that specifies latitude/longitude coordinates e.g. (37.42307,-122.08427 ).
location parameter but does not also specify a value for the locationRadius parameter. |
locationRadius |
string |
The locationRadius parameter, in conjunction with the location parameter, defines a circular geographic area.The parameter value must be a floating point number followed by a measurement unit. Valid measurement units are m , km , ft , and mi . For example, valid parameter values include 1500m , 5km , 10000ft , and 0.75mi . The API does not support locationRadius parameter values larger than 1000 kilometers.Note: See the definition of the location parameter for more information. |
maxResults |
unsigned integer |
The maxResults parameter specifies the maximum number of items that should be returned in the result set. Acceptable values are 0 to 50 , inclusive. The default value is 5 . |
onBehalfOfContentOwner |
string |
This parameter can only be used in a properly authorized request. Note: This parameter is intended exclusively for YouTube content partners. The onBehalfOfContentOwner parameter indicates that the request's authorization credentials identify a YouTube CMS user who is acting on behalf of the content owner specified in the parameter value. This parameter is intended for YouTube content partners that own and manage many different YouTube channels. It allows content owners to authenticate once and get access to all their video and channel data, without having to provide authentication credentials for each individual channel. The CMS account that the user authenticates with must be linked to the specified YouTube content owner. |
order |
string |
The order parameter specifies the method that will be used to order resources in the API response. The default value is relevance .Acceptable values are:
|
pageToken |
string |
The pageToken parameter identifies a specific page in the result set that should be returned. In an API response, the nextPageToken and prevPageToken properties identify other pages that could be retrieved. |
publishedAfter |
datetime |
The publishedAfter parameter indicates that the API response should only contain resources created after the specified time. The value is an RFC 3339 formatted date-time value (1970-01-01T00:00:00Z). |
publishedBefore |
datetime |
The publishedBefore parameter indicates that the API response should only contain resources created before the specified time. The value is an RFC 3339 formatted date-time value (1970-01-01T00:00:00Z). |
q |
string |
The q parameter specifies the query term to search for.Your request can also use the Boolean NOT ( - ) and OR (| ) operators to exclude videos or to find videos that are associated with one of several search terms. For example, to search for videos matching either "boating" or "sailing", set the q parameter value to boating|sailing . Similarly, to search for videos matching either "boating" or "sailing" but not "fishing", set the q parameter value to boating|sailing -fishing . Note that the pipe character must be URL-escaped when it is sent in your API request. The URL-escaped value for the pipe character is %7C . |
regionCode |
string |
The regionCode parameter instructs the API to return search results for the specified country. The parameter value is an ISO 3166-1 alpha-2 country code. |
safeSearch |
string |
The safeSearch parameter indicates whether the search results should include restricted content as well as standard content.Acceptable values are:
|
topicId |
string |
The topicId parameter indicates that the API response should only contain resources associated with the specified topic. The value identifies a Freebase topic ID. |
type |
string |
The type parameter restricts a search query to only retrieve a particular type of resource. The value is a comma-separated list of resource types. The default value is video,channel,playlist .Acceptable values are:
|
videoCaption |
string |
The videoCaption parameter indicates whether the API should filter video search results based on whether they have captions. If you specify a value for this parameter, you must also set the type parameter's value to video .Acceptable values are:
|
videoCategoryId |
string |
The videoCategoryId parameter filters video search results based on their category. If you specify a value for this parameter, you must also set the type parameter's value to video . |
videoDefinition |
string |
The videoDefinition parameter lets you restrict a search to only include either high definition (HD) or standard definition (SD) videos. HD videos are available for playback in at least 720p, though higher resolutions, like 1080p, might also be available. If you specify a value for this parameter, you must also set the type parameter's value to video .Acceptable values are:
|
videoDimension |
string |
The videoDimension parameter lets you restrict a search to only retrieve 2D or 3D videos. If you specify a value for this parameter, you must also set the type parameter's value to video .Acceptable values are:
|
videoDuration |
string |
The videoDuration parameter filters video search results based on their duration. If you specify a value for this parameter, you must also set the type parameter's value to video .Acceptable values are:
|
videoEmbeddable |
string |
The videoEmbeddable parameter lets you to restrict a search to only videos that can be embedded into a webpage. If you specify a value for this parameter, you must also set the type parameter's value to video .Acceptable values are:
|
videoLicense |
string |
The videoLicense parameter filters search results to only include videos with a particular license. YouTube lets video uploaders choose to attach either the Creative Commons license or the standard YouTube license to each of their videos. If you specify a value for this parameter, you must also set the type parameter's value to video .Acceptable values are:
|
videoSyndicated |
string |
The videoSyndicated parameter lets you to restrict a search to only videos that can be played outside youtube.com. If you specify a value for this parameter, you must also set the type parameter's value to video .Acceptable values are:
|
videoType |
string |
The videoType parameter lets you restrict a search to a particular type of videos. If you specify a value for this parameter, you must also set the type parameter's value to video .Acceptable values are:
|
part
:
id
: Returns only resource ID datasnippet
: Returns some basic meta data about the resourcechannelId
:
maxResults
:
order
:
pageToken
:
publishedAfter
:
publishedBefore
:
q
:
|
-
key
:
In [3]:
parameters = {"part": "snippet",
"maxResults": 5,
"order": "date",
"pageToken": "",
"publishedAfter": "2008-08-04T00:00:00Z",
"publishedBefore": "2008-11-04T00:00:00Z",
"q": "",
"key": api_key,
"type": "video",
}
url = "https://www.googleapis.com/youtube/v3/search"
In [4]:
parameters["q"] = "Mark Udall"
page = requests.request(method="get", url=url, params=parameters)
j_results = json.loads(page.text)
print page.text
https://developers.google.com/youtube/v3/docs/videos/list
GET https://www.googleapis.com/youtube/v3/videos
Parameter name | Value | Description |
---|---|---|
Required parameters | ||
part |
string |
The part parameter specifies a comma-separated list of one or more video resource properties that the API response will include.If the parameter identifies a property that contains child properties, the child properties will be included in the response. For example, in a video resource, the snippet property contains the channelId , title , description , tags , and categoryId properties. As such, if you set part=snippet , the API response will contain all of those properties.The list below contains the part names that you can include in the parameter value and the quota cost for each part:
|
Filters (specify exactly one of the following parameters) | ||
chart |
string |
The chart parameter identifies the chart that you want to retrieve.Acceptable values are:
|
id |
string |
The id parameter specifies a comma-separated list of the YouTube video ID(s) for the resource(s) that are being retrieved. In a video resource, the id property specifies the video's ID. |
myRating |
string |
This parameter can only be used in a properly authorized request. Set this parameter's value to like or dislike to instruct the API to only return videos liked or disliked by the authenticated user.Acceptable values are:
|
Optional parameters | ||
maxResults |
unsigned integer |
The maxResults parameter specifies the maximum number of items that should be returned in the result set.Note: This parameter is supported for use in conjunction with the myRating parameter, but it is not supported for use in conjunction with the id parameter. Acceptable values are 1 to 50 , inclusive. The default value is 5 . |
onBehalfOfContentOwner |
string |
This parameter can only be used in a properly authorized request. Note: This parameter is intended exclusively for YouTube content partners. The onBehalfOfContentOwner parameter indicates that the request's authorization credentials identify a YouTube CMS user who is acting on behalf of the content owner specified in the parameter value. This parameter is intended for YouTube content partners that own and manage many different YouTube channels. It allows content owners to authenticate once and get access to all their video and channel data, without having to provide authentication credentials for each individual channel. The CMS account that the user authenticates with must be linked to the specified YouTube content owner. |
pageToken |
string |
The pageToken parameter identifies a specific page in the result set that should be returned. In an API response, the nextPageToken and prevPageToken properties identify other pages that could be retrieved.Note: This parameter is supported for use in conjunction with the myRating parameter, but it is not supported for use in conjunction with the id parameter. |
regionCode |
string |
The regionCode parameter instructs the API to select a video chart available in the specified region. This parameter can only be used in conjunction with the chart parameter. The parameter value is an ISO 3166-1 alpha-2 country code. |
videoCategoryId |
string |
The videoCategoryId parameter identifies the video category for which the chart should be retrieved. This parameter can only be used in conjunction with the chart parameter. By default, charts are not restricted to a particular category. The default value is 0 . |
In [5]:
parameters = {"part": "statistics",
"id": "5Q98TvXjIZg",
"key": api_key,
}
url = "https://www.googleapis.com/youtube/v3/videos"
In [6]:
page = requests.request(method="get", url=url, params=parameters)
j_results = json.loads(page.text)
print page.text
I'll check the coorelation between the results of 2008 Senate elections results and YouTube Stats.
Colorado Senate - Gardner vs. Udall Cory Gardner (R) Mark Udall (D)
In [7]:
def _search_list(q="", publishedAfter=None, publishedBefore=None, pageToken=""):
parameters = {"part": "id",
"maxResults": 50,
"order": "viewCount",
"pageToken": pageToken,
"q": q,
"type": "video",
"key": api_key,
}
url = "https://www.googleapis.com/youtube/v3/search"
if publishedAfter: parameters["publishedAfter"] = publishedAfter
if publishedBefore: parameters["publishedBefore"] = publishedBefore
page = requests.request(method="get", url=url, params=parameters)
return json.loads(page.text)
def search_list(q="", publishedAfter=None, publishedBefore=None, max_requests=10):
more_results = True
pageToken=""
results = []
for counter in range(max_requests):
j_results = _search_list(q=q, publishedAfter=publishedAfter, publishedBefore=publishedBefore, pageToken=pageToken)
items = j_results.get("items", None)
if items:
results += [item["id"]["videoId"] for item in j_results["items"]]
if j_results.has_key("nextPageToken"):
pageToken = j_results["nextPageToken"]
else:
return results
else:
return results
return results
def _video_list(video_id_list):
parameters = {"part": "statistics",
"id": ",".join(video_id_list),
"key": api_key,
"maxResults": 50
}
url = "https://www.googleapis.com/youtube/v3/videos"
page = requests.request(method="get", url=url, params=parameters)
j_results = json.loads(page.text)
df = pd.DataFrame([item["statistics"] for item in j_results["items"]], dtype=np.int64)
df["video_id"] = [item["id"] for item in j_results["items"]]
parameters["part"] = "snippet"
page = requests.request(method="get", url=url, params=parameters)
j_results = json.loads(page.text)
df["publishedAt"] = [item["snippet"]["publishedAt"] for item in j_results["items"]]
df["publishedAt"] = df["publishedAt"].apply(lambda x: datetime.strptime(x, "%Y-%m-%dT%H:%M:%S.000Z"))
df["date"] = df["publishedAt"].apply(lambda x: x.date())
df["week"] = df["date"].apply(lambda x: x.isocalendar()[1])
df["channelId"] = [item["snippet"]["channelId"] for item in j_results["items"]]
df["title"] = [item["snippet"]["title"] for item in j_results["items"]]
df["description"] = [item["snippet"]["description"] for item in j_results["items"]]
df["channelTitle"] = [item["snippet"]["channelTitle"] for item in j_results["items"]]
df["categoryId"] = [item["snippet"]["categoryId"] for item in j_results["items"]]
return df
def video_list(video_id_list):
values = []
for index, item in enumerate(video_id_list[::50]):
t_index = index * 50
values.append(_video_list(video_id_list[t_index:t_index+50]))
return pd.concat(values)
In [8]:
def get_data(candidates, publishedAfter, publishedBefore):
results_list = []
for q in candidates:
results = search_list(q=q,
publishedAfter=publishedAfter,
publishedBefore=publishedBefore,
max_requests=50)
stat_data_set = video_list(results)
stat_data_set["candidate_name"] = q
results_list.append(stat_data_set)
data_set = pd.concat(results_list)
return data_set
def get_2008_data(candidates):
return get_data(candidates, publishedAfter="2008-08-04T00:00:00Z", publishedBefore="2008-11-04T00:00:00Z")
def get_2010_data(candidates):
return get_data(candidates, publishedAfter="2010-08-04T00:00:00Z", publishedBefore="2010-11-04T00:00:00Z")
def get_2012_data(candidates):
return get_data(candidates, publishedAfter="2012-08-04T00:00:00Z", publishedBefore="2012-11-04T00:00:00Z")
def get_2014_data(candidates):
return get_data(candidates, publishedAfter="2014-08-04T00:00:00Z", publishedBefore="2014-11-04T00:00:00Z")
In [9]:
candidates = ["Cory Gardner", "Mark Udall"] # Cory Gardner (R), Mark Udall (D)*
colorado_2014_ds = get_2014_data(candidates)
pd.pivot_table(colorado_2014_ds, values=["commentCount", "favoriteCount", "dislikeCount", "likeCount", "viewCount"],
aggfunc='sum', rows="candidate_name")
Out[9]:
In [10]:
for candidate, color in zip(candidates, ["r", "b"]):
cand = colorado_2014_ds[colorado_2014_ds["candidate_name"]==candidate]
by_date = cand["week"].value_counts()
by_date = by_date.sort_index()
dates = by_date.index
plt.plot(dates, by_date.values, "-o", label=candidate, c=color, linewidth=2)
plt.legend(loc="best")
plt.ylabel("Videos Published")
plt.xlabel("Week")
plt.show()
In [11]:
for candidate, color in zip(candidates, ["r", "b"]):
cand = colorado_2014_ds[colorado_2014_ds["candidate_name"]==candidate]
by_date = pd.pivot_table(cand, rows=["week"], values=["viewCount"], aggfunc="sum")
by_date = by_date.sort_index()
dates = by_date.index
plt.plot(dates, by_date.values, "-o", label=candidate, c=color, linewidth=2)
plt.legend(loc="best")
plt.ylabel("Videos viewCount")
plt.xlabel("Week")
plt.show()
In [12]:
for candidate, color in zip(candidates, ["r", "b"]):
cand = colorado_2014_ds[colorado_2014_ds["candidate_name"]==candidate]
by_date = pd.pivot_table(cand, rows=["week"], values=["likeCount"], aggfunc="sum")
by_date = by_date.sort_index()
dates = by_date.index
plt.plot(dates, by_date.values, "-o", label=candidate, c=color, linewidth=2)
plt.legend(loc="best")
plt.ylabel("Videos likeCount")
plt.xlabel("Week")
plt.show()
In [13]:
for candidate, color in zip(candidates, ["r", "b"]):
cand = colorado_2014_ds[colorado_2014_ds["candidate_name"]==candidate]
by_date = pd.pivot_table(cand, rows=["week"], values=["dislikeCount"], aggfunc="sum")
by_date = by_date.sort_index()
dates = by_date.index
plt.plot(dates, by_date.values, "-o", label=candidate, c=color, linewidth=2)
plt.legend(loc="best")
plt.ylabel("Videos dislikeCount")
plt.xlabel("Week")
plt.show()
In [14]:
candidates = ["George Allen", "Tim Kaine"] # George Allen (R), Tim Kaine (D)Winner
va_2012_ds = get_2012_data(candidates)
pd.pivot_table(va_2012_ds, values=["commentCount", "favoriteCount", "dislikeCount", "likeCount", "viewCount"],
aggfunc='sum', rows="candidate_name")
Out[14]:
In [15]:
for candidate, color in zip(candidates, ["r", "b"]):
cand = va_2012_ds[va_2012_ds["candidate_name"]==candidate]
by_date = cand["week"].value_counts()
by_date = by_date.sort_index()
dates = by_date.index
plt.plot(dates, by_date.values, "-o", label=candidate, c=color, linewidth=2)
plt.legend(loc="best")
plt.ylabel("Videos Published")
plt.xlabel("Week")
plt.show()
In [16]:
for candidate, color in zip(candidates, ["r", "b"]):
cand = va_2012_ds[va_2012_ds["candidate_name"]==candidate]
by_date = pd.pivot_table(cand, rows=["week"], values=["viewCount"], aggfunc="sum")
by_date = by_date.sort_index()
dates = by_date.index
plt.plot(dates, by_date.values, "-o", label=candidate, c=color, linewidth=2)
plt.legend(loc="best")
plt.ylabel("Videos viewCount")
plt.xlabel("Week")
plt.show()
In [17]:
for candidate, color in zip(candidates, ["r", "b"]):
cand = va_2012_ds[va_2012_ds["candidate_name"]==candidate]
by_date = pd.pivot_table(cand, rows=["week"], values=["likeCount"], aggfunc="sum")
by_date = by_date.sort_index()
dates = by_date.index
plt.plot(dates, by_date.values, "-o", label=candidate, c=color, linewidth=2)
plt.legend(loc="best")
plt.ylabel("Videos likeCount")
plt.xlabel("Week")
plt.show()
In [18]:
for candidate, color in zip(candidates, ["r", "b"]):
cand = va_2012_ds[va_2012_ds["candidate_name"]==candidate]
by_date = pd.pivot_table(cand, rows=["week"], values=["dislikeCount"], aggfunc="sum")
by_date = by_date.sort_index()
dates = by_date.index
plt.plot(dates, by_date.values, "-o", label=candidate, c=color, linewidth=2)
plt.legend(loc="best")
plt.ylabel("Videos dislikeCount")
plt.xlabel("Week")
plt.show()
In [19]:
candidates = ["Dean Heller", "Shelley Berkley"] # Dean Heller (R)*Winnner, Shelley Berkley (D)
nv_2012_ds = get_2012_data(candidates)
print pd.pivot_table(nv_2012_ds, values=["commentCount", "favoriteCount", "dislikeCount", "likeCount", "viewCount"],
aggfunc='sum', rows="candidate_name")
for candidate, color in zip(candidates, ["r", "b"]):
cand = nv_2012_ds[nv_2012_ds["candidate_name"]==candidate]
by_date = cand["week"].value_counts()
by_date = by_date.sort_index()
dates = by_date.index
plt.plot(dates, by_date.values, "-o", label=candidate, c=color, linewidth=2)
plt.legend(loc="best")
plt.ylabel("Videos Published")
plt.xlabel("Week")
plt.show()
for candidate, color in zip(candidates, ["r", "b"]):
cand = nv_2012_ds[nv_2012_ds["candidate_name"]==candidate]
by_date = pd.pivot_table(cand, rows=["week"], values=["viewCount"], aggfunc="sum")
by_date = by_date.sort_index()
dates = by_date.index
plt.plot(dates, by_date.values, "-o", label=candidate, c=color, linewidth=2)
plt.legend(loc="best")
plt.ylabel("Videos viewCount")
plt.xlabel("Week")
plt.show()
for candidate, color in zip(candidates, ["r", "b"]):
cand = nv_2012_ds[nv_2012_ds["candidate_name"]==candidate]
by_date = pd.pivot_table(cand, rows=["week"], values=["likeCount"], aggfunc="sum")
by_date = by_date.sort_index()
dates = by_date.index
plt.plot(dates, by_date.values, "-o", label=candidate, c=color, linewidth=2)
plt.legend(loc="best")
plt.ylabel("Videos likeCount")
plt.xlabel("Week")
plt.show()
for candidate, color in zip(candidates, ["r", "b"]):
cand = nv_2012_ds[nv_2012_ds["candidate_name"]==candidate]
by_date = pd.pivot_table(cand, rows=["week"], values=["dislikeCount"], aggfunc="sum")
by_date = by_date.sort_index()
dates = by_date.index
plt.plot(dates, by_date.values, "-o", label=candidate, c=color, linewidth=2)
plt.legend(loc="best")
plt.ylabel("Videos dislikeCount")
plt.xlabel("Week")
plt.show()
In [20]:
url = "http://www.senate.gov/general/contact_information/senators_cfm.xml"
response = requests.get(url)
tree = etree.fromstring(str(response.text))
print tree
In [21]:
member_full = [member.xpath("member_full")[0].text for member in tree.xpath("//member")]
senators = pd.DataFrame(member_full, columns=["member_full"])
senators["member_full"] = member_full
senators["last_name"] = [member.xpath("last_name")[0].text for member in tree.xpath("//member")]
senators["first_name"] = [member.xpath("first_name")[0].text for member in tree.xpath("//member")]
senators["party"] = [member.xpath("party")[0].text for member in tree.xpath("//member")]
senators["state"] = [member.xpath("state")[0].text for member in tree.xpath("//member")]
senators["address"] = [member.xpath("address")[0].text for member in tree.xpath("//member")]
senators["phone"] = [member.xpath("phone")[0].text for member in tree.xpath("//member")]
senators["website"] = [member.xpath("website")[0].text for member in tree.xpath("//member")]
senators["bioguide_id"] = [member.xpath("bioguide_id")[0].text for member in tree.xpath("//member")]
senators["class"] = [member.xpath("class")[0].text for member in tree.xpath("//member")]
senators
Out[21]:
In [22]:
by_party = senators["party"].value_counts()
by_party.sort(ascending=False)
print by_party
color_dict = {"D": "b",
"R": "r",
"I": "g"}
labels = ["%s: %s" % (by_party.index[index], value) for index, value in enumerate(by_party)]
colors = list(pd.Series(by_party.index).map(color_dict))
plt.figure()
plt.axis("equal")
plt.pie(by_party.values, labels=labels, colors=colors, shadow=True, explode=np.zeros(len(by_party)) + 0.04)
plt.show()
fig = plt.figure()
axes = fig.add_subplot(111)
axes.barh(range(len(by_party.index)), by_party.values, color=colors)
plt.box(on="off")
axes.axvline(x=50, color="black", alpha=0.7, linewidth=2)
axes.yaxis.set_ticks([item + 0.4 for item in range(len(by_party.index))])
axes.yaxis.set_ticklabels(by_party.index, minor=False)
plt.xlabel("$113^{th}$ Senate Seats Controlled by Party")
plt.show()
Class II senators are up for re-election.
In [23]:
class_2_senators = senators[senators["class"]=="Class II"]
by_party =class_2_senators["party"].value_counts()
by_party.sort(ascending=False)
print by_party
labels = ["%s: %s" % (by_party.index[index], value) for index, value in enumerate(by_party)]
colors = list(pd.Series(by_party.index).map(color_dict))
plt.figure()
plt.axis("equal")
plt.pie(by_party.values, labels=labels, colors=colors, shadow=True, explode=np.zeros(len(by_party)) + 0.04)
plt.show()
color_dict = {"D": "b",
"R": "r",
"I": "g"}
fig = plt.figure()
axes = fig.add_subplot(111)
axes.barh(range(len(by_party.index)), by_party.values, color=colors)
plt.box(on="off")
axes.yaxis.set_ticks([item + 0.4 for item in range(len(by_party.index))])
axes.yaxis.set_ticklabels(by_party.index, minor=False)
plt.xlabel("$113^{th}$ Senate Seats of $Class II$ Controlled by Party")
plt.show()
In [24]:
class_3_senators = senators[senators["class"]=="Class III"]
by_party =class_3_senators["party"].value_counts()
by_party.sort(ascending=False)
print by_party
labels = ["%s: %s" % (by_party.index[index], value) for index, value in enumerate(by_party)]
colors = list(pd.Series(by_party.index).map(color_dict))
plt.figure()
plt.axis("equal")
plt.pie(by_party.values, labels=labels, colors=colors, shadow=True, explode=np.zeros(len(by_party)) + 0.04)
plt.show()
color_dict = {"D": "b",
"R": "r",
"I": "g"}
fig = plt.figure()
axes = fig.add_subplot(111)
axes.barh(range(len(by_party.index)), by_party.values, color=colors)
plt.box(on="off")
axes.yaxis.set_ticks([item + 0.4 for item in range(len(by_party.index))])
axes.yaxis.set_ticklabels(by_party.index, minor=False)
plt.xlabel("$113^{th}$ Senate Seats of $Class III$ Controlled by Party")
plt.show()
In [25]:
class_1_senators = senators[senators["class"]=="Class I"]
by_party =class_1_senators["party"].value_counts()
by_party.sort(ascending=False)
print by_party
labels = ["%s: %s" % (by_party.index[index], value) for index, value in enumerate(by_party)]
colors = list(pd.Series(by_party.index).map(color_dict))
plt.figure()
plt.axis("equal")
plt.pie(by_party.values, labels=labels, colors=colors, shadow=True, explode=np.zeros(len(by_party)) + 0.04)
plt.show()
color_dict = {"D": "b",
"R": "r",
"I": "g"}
fig = plt.figure()
axes = fig.add_subplot(111)
axes.barh(range(len(by_party.index)), by_party.values, color=colors)
plt.box(on="off")
axes.yaxis.set_ticks([item + 0.4 for item in range(len(by_party.index))])
axes.yaxis.set_ticklabels(by_party.index, minor=False)
plt.xlabel("$113^{th}$ Senate Seats of $Class I$ Controlled by Party")
plt.show()
Start with listing all seat in $Class II$
In [26]:
class_2_senators = senators[senators["class"]=="Class II"].sort("state")
class_2_senators
Out[26]:
In [63]:
url = "http://www.fec.gov/data/CandidateSummary.do?format=xml"
response = requests.get(url)
page = html.fromstring(str(response.text))
print response.text[:1000]
In [64]:
for item in page[:10]:
print item.tag
Notice <can_sum>
encapsulates the candidates data.
In [65]:
for item in page.xpath("//can_sum")[0]:
print "<%s>%s</%s>" % (item.tag, str(item.text), item.tag)
In [66]:
cand_list = [cand for cand in page.xpath("//can_sum") if cand.xpath("can_off")[0].text=="S"]
lin_ima = [cand.xpath("lin_ima")[0].text for cand in cand_list]
len(lin_ima)
Out[66]:
In [67]:
senate_cadidate = pd.DataFrame(lin_ima, columns=["lin_ima"])
senate_cadidate["can_id"] = [cand.xpath("can_id")[0].text for cand in cand_list]
senate_cadidate["can_nam"] = [cand.xpath("can_nam")[0].text for cand in cand_list]
senate_cadidate["can_off"] = [cand.xpath("can_off")[0].text for cand in cand_list]
senate_cadidate["can_off_sta"] = [cand.xpath("can_off_sta")[0].text for cand in cand_list]
senate_cadidate["can_par_aff"] = [cand.xpath("can_par_aff")[0].text for cand in cand_list]
senate_cadidate["can_inc_cha_ope_sea"] = [cand.xpath("can_inc_cha_ope_sea")[0].text for cand in cand_list]
senate_cadidate["ind_ite_con"] = [cand.xpath("ind_ite_con")[0].text for cand in cand_list]
senate_cadidate["ind_uni_con"] = [cand.xpath("ind_uni_con")[0].text for cand in cand_list]
senate_cadidate["ind_con"] = [cand.xpath("ind_con")[0].text for cand in cand_list]
senate_cadidate["par_com_con"] = [cand.xpath("par_com_con")[0].text for cand in cand_list]
senate_cadidate["oth_com_con"] = [cand.xpath("oth_com_con")[0].text for cand in cand_list]
senate_cadidate["can_con"] = [cand.xpath("can_con")[0].text for cand in cand_list]
senate_cadidate["tot_con"] = [cand.xpath("tot_con")[0].text for cand in cand_list]
senate_cadidate["tra_fro_oth_aut_com"] = [cand.xpath("tra_fro_oth_aut_com")[0].text for cand in cand_list]
senate_cadidate["can_loa"] = [cand.xpath("can_loa")[0].text for cand in cand_list]
senate_cadidate["oth_loa"] = [cand.xpath("oth_loa")[0].text for cand in cand_list]
senate_cadidate["tot_loa"] = [cand.xpath("tot_loa")[0].text for cand in cand_list]
senate_cadidate["off_to_ope_exp"] = [cand.xpath("off_to_ope_exp")[0].text for cand in cand_list]
senate_cadidate["off_to_fun"] = [cand.xpath("off_to_fun")[0].text for cand in cand_list]
senate_cadidate["off_to_leg_acc"] = [cand.xpath("off_to_leg_acc")[0].text for cand in cand_list]
senate_cadidate["oth_rec"] = [cand.xpath("oth_rec")[0].text for cand in cand_list]
senate_cadidate["tot_rec"] = [cand.xpath("tot_rec")[0].text for cand in cand_list]
senate_cadidate["ope_exp"] = [cand.xpath("ope_exp")[0].text for cand in cand_list]
senate_cadidate["fun_dis"] = [cand.xpath("fun_dis")[0].text for cand in cand_list]
senate_cadidate["exe_leg_acc_dis"] = [cand.xpath("exe_leg_acc_dis")[0].text for cand in cand_list]
senate_cadidate["tra_to_oth_aut_com"] = [cand.xpath("tra_to_oth_aut_com")[0].text for cand in cand_list]
senate_cadidate["can_loa_rep"] = [cand.xpath("can_loa_rep")[0].text for cand in cand_list]
senate_cadidate["oth_loa_rep"] = [cand.xpath("oth_loa_rep")[0].text for cand in cand_list]
senate_cadidate["tot_loa_rep"] = [cand.xpath("tot_loa_rep")[0].text for cand in cand_list]
senate_cadidate["ind_ref"] = [cand.xpath("ind_ref")[0].text for cand in cand_list]
senate_cadidate["par_com_ref"] = [cand.xpath("par_com_ref")[0].text for cand in cand_list]
senate_cadidate["oth_com_ref"] = [cand.xpath("oth_com_ref")[0].text for cand in cand_list]
senate_cadidate["tot_con_ref"] = [cand.xpath("tot_con_ref")[0].text for cand in cand_list]
senate_cadidate["oth_dis"] = [cand.xpath("oth_dis")[0].text for cand in cand_list]
senate_cadidate["tot_dis"] = [cand.xpath("tot_dis")[0].text for cand in cand_list]
senate_cadidate["cas_on_han_beg_of_per"] = [cand.xpath("cas_on_han_beg_of_per")[0].text for cand in cand_list]
senate_cadidate["cas_on_han_clo_of_per"] = [cand.xpath("cas_on_han_clo_of_per")[0].text for cand in cand_list]
senate_cadidate["net_con"] = [cand.xpath("net_con")[0].text for cand in cand_list]
senate_cadidate["net_ope_exp"] = [cand.xpath("net_ope_exp")[0].text for cand in cand_list]
senate_cadidate["deb_owe_by_com"] = [cand.xpath("deb_owe_by_com")[0].text for cand in cand_list]
senate_cadidate["deb_owe_to_com"] = [cand.xpath("deb_owe_to_com")[0].text for cand in cand_list]
senate_cadidate["cov_sta_dat"] = [cand.xpath("cov_sta_dat")[0].text for cand in cand_list]
senate_cadidate["cov_end_dat"] = [cand.xpath("cov_end_dat")[0].text for cand in cand_list]
senate_cadidate
Out[67]:
In [69]:
def get_state_data(candidates):
data_set = get_2014_data(candidates)
t_ds = pd.pivot_table(data_set, values=["commentCount", "favoriteCount", "dislikeCount", "likeCount", "viewCount"],
aggfunc='sum', rows="candidate_name")
t_ds["like_dislike_r"] = t_ds["likeCount"] / (t_ds["dislikeCount"] + t_ds["likeCount"])
t_ds["views_share"] = t_ds["viewCount"] / t_ds["viewCount"].sum()
t_ds["msgs_share"] = t_ds["commentCount"] / t_ds["commentCount"].sum()
t_ds["likes_share"] = t_ds["likeCount"] / t_ds["likeCount"].sum()
t_ds["dislikes_share"] = t_ds["dislikeCount"] / t_ds["dislikeCount"].sum()
print t_ds
return t_ds
def fix_name(val_name):
val_names = val_name.split(", ")
return "%s %s" % (val_names[1].split(" ")[0].capitalize(), val_names[0].capitalize())
In [79]:
values_list = []
for index, state in zip(class_2_senators.index, class_2_senators["state"]):
print "%s: %s" % (state,
class_2_senators["member_full"][index])
candidates = senate_cadidate[senate_cadidate["can_off_sta"]==state]
candidates = candidates[~senate_cadidate["tot_rec"].isnull()]
candidates["tot_rec_num"] = candidates["tot_rec"].apply(lambda x: x[1:].replace(",","")).astype(np.float64)
top_candidates = candidates.sort("tot_rec_num", ascending=False)[:2][["can_nam",
"can_par_aff",
"can_inc_cha_ope_sea",
"tot_rec_num",
"can_off_sta"]]
top_candidates["full_name"] = [fix_name(name) for name in top_candidates.values[:,0]]
top_candidates = top_candidates.sort("full_name")
print top_candidates["full_name"]
try:
ds = get_state_data([fix_name(name) for name in top_candidates.values[:,0]])
ds["state"] = state
ds["party"] = top_candidates["can_par_aff"].values
ds["donations"] = top_candidates["tot_rec_num"].values
values_list.append(ds)
except:
print "NA"
sentate_2014 = pd.concat(values_list)
sentate_2014
Out[79]:
In [94]:
class_2_senators["state"]
Out[94]:
In [97]:
x_column = "views_share"
y_column = "viewCount"
s_column = "donations"
color_dict = {"DEM": "b", "REP": "r", "IND":"g", "NPA": "g", "DFL": "g"}
plt.figure(figsize=(18,12))
for party in sentate_2014["party"].unique():
cands = sentate_2014[sentate_2014["party"]==party]
x = cands[x_column]
y = cands[y_column]
size = sentate_2014[sentate_2014["party"]==party][s_column] / 3000000
plt.scatter(x,y, s=(np.array(size)) * 1000, c=color_dict[party], alpha=0.5)
print plt.ylim()[1]
plt.vlines(0.5, ymin=1, ymax=plt.ylim()[1]*0.9)
prejected_winners = sentate_2014[sentate_2014[x_column]>0.5]["party"].value_counts()
result_text = []
for item in sentate_2014.iterrows():#[sentate_2014[x_column]>0.5].iterrows():
plt.annotate(item[1]["state"], xy=(item[1][x_column], item[1][y_column]))
for item in sentate_2014[sentate_2014[x_column]>0.5].iterrows():
result_text += ["%s: %s (%s) - %0.1f%%" % (item[1]["state"], item[0], item[1]["party"], item[1]["views_share"] * 100.)]
result_text = "\n".join(result_text)
prejected_winners = "\n".join(["%s:%s" % (party, value) for party, value in zip(prejected_winners.index, prejected_winners.values)])
plt.annotate(prejected_winners, xy=(.65,plt.ylim()[1]*0.8))
plt.annotate(result_text, xy=(.8, 1.5))
plt.xlabel(x_column)
plt.ylabel(y_column + " (Log Scale)")
plt.grid()
plt.yscale("log")
#plt.axis("tight")
plt.title("Senate 2014 Elections Forecast (Size is relative and represents the amount of donations)")
plt.show()
In [58]:
sentate_2014[sentate_2014[x_column]>0.5]
Out[58]:
In [52]:
In [46]:
len(sentate_2014["state"].unique())
Out[46]:
In [ ]:
In [35]:
def get_state_data(candidates):
data_set = get_2012_data(candidates)
t_ds = pd.pivot_table(data_set, values=["commentCount", "favoriteCount", "dislikeCount", "likeCount", "viewCount"],
aggfunc='sum', rows="candidate_name")
t_ds["like_dislike_r"] = t_ds["likeCount"] / (t_ds["dislikeCount"] + t_ds["likeCount"])
t_ds["views_share"] = t_ds["viewCount"] / t_ds["viewCount"].sum()
t_ds["msgs_share"] = t_ds["commentCount"] / t_ds["commentCount"].sum()
t_ds["likes_share"] = t_ds["likeCount"] / t_ds["likeCount"].sum()
t_ds["dislikes_share"] = t_ds["dislikeCount"] / t_ds["dislikeCount"].sum()
# Sentemate Analysis of the title
t_ds["sentiment"] = pd.Series()
for cand in candidates:
t_ds["sentiment"][cand] = np.mean(
[TextBlob(title).polarity for title in data_set[data_set["candidate_name"]==cand]["title"]]
)
print t_ds
return t_ds
In [36]:
senate_2012 = pd.read_csv("data/2012_senate_results.csv")
senate_2012["Full Name"] = senate_2012["First Name"] + " " + senate_2012["Last Name"]
senate_2012
Out[36]:
In [37]:
senate_2012["commentCount"] = pd.Series()
senate_2012["dislikeCount"] = pd.Series()
senate_2012["favoriteCount"] = pd.Series()
senate_2012["likeCount"] = pd.Series()
senate_2012["viewCount"] = pd.Series()
senate_2012["like_dislike_r"] = pd.Series()
senate_2012["views_share"] = pd.Series()
senate_2012["msgs_share"] = pd.Series()
senate_2012["likes_share"] = pd.Series()
senate_2012["dislikes_share"] = pd.Series()
senate_2012["sentiment"] = pd.Series()
for state in np.unique(senate_2012["State Postal"]):
print state + ":"
cands = senate_2012[senate_2012["State Postal"] == state]
top_cands = cands.sort("Vote Count",ascending=False)[:2]
#print top_cands
try:
youtube_stats = get_state_data(top_cands["Full Name"].values)
#print youtube_stats
# Store Data Back
for item in youtube_stats.iterrows():
cand = item[0]
stats = item[1]
index = int(senate_2012[senate_2012["Full Name"] == cand].index)
senate_2012["commentCount"][index] = stats["commentCount"]
senate_2012["dislikeCount"][index] = stats["dislikeCount"]
senate_2012["favoriteCount"][index] = stats["favoriteCount"]
senate_2012["likeCount"][index] = stats["likeCount"]
senate_2012["viewCount"][index] = stats["viewCount"]
senate_2012["like_dislike_r"][index] = stats["like_dislike_r"]
senate_2012["views_share"][index] = stats["views_share"]
senate_2012["msgs_share"][index] = stats["msgs_share"]
senate_2012["likes_share"][index] = stats["likes_share"]
senate_2012["dislikes_share"][index] = stats["dislikes_share"]
senate_2012["sentiment"][index] = stats["sentiment"]
except:
pass
In [38]:
cands_with_stats = senate_2012[~senate_2012["viewCount"].isnull()]
cands_with_stats["VotesShare"] = cands_with_stats[["Vote Count", "State Postal"]].apply(lambda x:x[0]/senate_2012[senate_2012["State Postal"]==x[1]]["Vote Count"].sum(), axis=1)
In [39]:
x_col = "views_share"
y_col = "VotesShare"
plt.figure(figsize=(15,10))
color_dict = {"Dem": "b", "GOP": "r", "Ind":"g", "NPA": "orange"}
shape_dict = {"X": "*", "nan": "."}
wl_dp = [len(cands_with_stats[(cands_with_stats[x_col]>=0.5) &
(cands_with_stats["Winner"]=="X")]),
len(cands_with_stats[(cands_with_stats[x_col]>=0.5)])]
wl_dm = [len(cands_with_stats[(cands_with_stats[x_col]<0.5) &
(cands_with_stats["Winner"]=="X")]),
len(cands_with_stats[(cands_with_stats[x_col]<0.5)])]
wl_50p = "Winning Ratio %s/%s ($%0.1f \%%$)" % (wl_dp[0], wl_dp[1], wl_dp[0]/wl_dp[1]*100)
wl_50m = "Winning Ratio %s/%s ($%0.1f \%%$)" % (wl_dm[0], wl_dm[1], wl_dm[0]/wl_dm[1]*100)
for cand in cands_with_stats.iterrows():
stats = cand[1]
x = stats[x_col]
y = stats[y_col]
c = color_dict[stats["Party"]]
m = shape_dict[str(stats["Winner"])]
plt.scatter(x, y, c=c, marker=m, s=500, alpha=0.5)
if stats[x_col] > 0.9:
plt.annotate(stats["Full Name"],xytext=(8,20), xy=(x,y),
textcoords='offset points', arrowprops=dict(arrowstyle='-|>'))
plt.xlabel("Youtube " + x_col + " Between Competing Candidates in a State Race")
plt.ylabel("Actual " + y_col)
plt.vlines(.5, ymin=0, ymax=1)
plt.annotate(s=wl_50p, xy=(0.7, 1))
plt.annotate(s=wl_50m, xy=(0.2, 1))
plt.title("Youtube Video Views for Candidate from 2012-08-04 to 2012-11-04 and Actual Votes")
plt.annotate("Start Represent Winning Candidates\nCircles Represent Loosing Candidate", xy=(0.03, 0.85))
plt.annotate("Red: GOP\nBlue: Dem\nGreen: Ind\nYellow: NPA", xy=(0.03, 0.7))
axis("tight")
plt.box(on="off")
plt.show()
In [40]:
cands_with_stats[cands_with_stats["State Postal"]=="MO"]
Out[40]: